Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
We have implemented GPU-aware support across all AWP-ODC versions and enhanced message-passing collective communications for this memory-bound finite-difference solver. This provides cutting-edge communication support for production simulations on leadership-class computing facilities, including OLCF Frontier and TACC Vista. We achieved significant performance gains, reaching 37 sustained Petaflop/s and reducing time-to-solution by 17.2% using the GPU-aware feature on 8,192 Frontier nodes, or 65,336 MI250X GCDs. The AWP-ODC code has also been optimized for TACC Vista, an Arm-based NVIDIA GH200 Grace Hopper Superchip, demonstrating excellent application performance. This poster will showcase studies and GPU performance characteristics. We will discuss our verification of GPU-aware development and the use of high-performance MVAPICH libraries, including on-the-fly compression, on modern GPU clusters.more » « less
-
We present Radius, a gradient sparsity algorithm and system to accelerate large foundation model (FM) training while preserving downstream task performance. Radius leverages two key insights in large FM pre-training: 1) only a small portion of gradients contribute to the model updates in each iteration, and 2) the spatial distribution of the gradients with large magnitude is stable over time. Radius overcomes the scaling problem of existing top-k sparsity methods, as it maintains the structure of sparse gradients thus avoids dense communication. We examine the convergence and speed of Radius on pre-training GPT models (355M and 2.0B) in data-parallel and compare it with the baseline top-k sparsification methods. Our results show that using the existing top-k method with AdamW optimizer fails to converge, and the training speed improvement with sparse communication is marginal. In contrast, Radius with 40% sparsity reduces per-step training time by 21% (19% for overall training time) across 64 NVIDIA A100 GPUs that are connected by the Slingshot 11 interconnect while preserving the downstream task performance.more » « less
-
We propose SLOPE, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLOPE improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLOPE uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLOPE accelerates the training and inference of models with billions of parameters up to 1.25→ and 1.54→ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to 0.63→ and 0.61→ for training and inference respectively.more » « less
-
Despite the widespread exploration and availability of parcellations for the functional connectome, parcellations designed for the structural connectome are comparatively limited. Current research suggests that there may be no single “correct” parcellation and that the human brain is intrinsically a multiresolution entity. In this work, we propose the Continuous Structural Connectivitity-based, Nested (CoCoNest) family of parcellations—a fully data-driven, multiresolution family of parcellations derived from structural connectome data. The CoCoNest family is created using agglomerative (bottom-up) clustering and error-complexity pruning, which strikes a balance between the complexity of each parcellation and how well it preserves patterns in vertex-level, high-resolution connectivity data. We draw on a comprehensive battery of internal and external evaluation metrics to show that the CoCoNest family is competitive with or outperforms widely used parcellations in the literature. Additionally, we show how the CoCoNest family can serve as an exploratory tool for researchers to investigate the multiresolution organization of the structural connectome.more » « less
-
Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.more » « less
An official website of the United States government

Full Text Available